runtime: add runtime.Yield() #8

dt · 2025-10-06T22:17:59Z

This change is

petermattis

I'm good with experimenting with this Go runtime enhancement. I definitely want to see experimental evidence of the benefit.

PS Should probably update the print in schedtrace to include sched.bgqsize.

src/runtime/proc.go

src/runtime/runtime2.go

src/runtime/proc.go

sumeerbhola · 2025-10-09T23:04:14Z

src/runtime/proc.go

+	// Yielded goroutines were runnable but voluntarily deprioritized themselves
+	// to waiting instead by calling BackgroundYield. If we have nothing runnable
+	// we can bring a yielded goroutine back to runnable and run it.
+	if sched.bgqsize != 0 {


Does the go compiler know something special about certain fields and makes their unsynchronized reads "less stale"?

Not that I can find. The runtime and in particular scheduler uses these a-bit-stale reads liberally (though in important things like findRunnable there are locked reads backstopping them after fast paths are exhausted). I was, uh, curious about this too. As far as I can tell -- and what chatgpt tells me as well -- is they're just good old unlocked, un-syned reads. They're uint32 so no worry about tearing a write (on 32b), so staleness is the only concern. They aren't in loops out of which the load might be lifted by the compiler, so it'd be an actual load, and just up to MESI or whatever I guess. Though I did my initial testing with a cruder version of this patch, before I cleaned it up for a PR, and one of my cleanups was to split to up to ensure the cheap checks could be inlined; I wonder if that is a mistake, since if it is inlined into a for loop -- the place we expect this to be called -- maybe the load does get lifted and then never sees new values? I guess I should re-test with the optimized version.

src/runtime/proc.go

src/runtime/backgroundyield_test.go

tbg · 2025-10-10T11:15:11Z

src/runtime/export_test.go

 	}
 }

+// RunBackgroundYieldQueueCheck exercises the background queue enqueue/dequeue


I'm confused by this method. What does it exercise? What does success mean? Why does it return skipped==true if there's something in the bgq? Why is this only used in a test that doesn't seem to do anything?

Can’t use testing.T or testing at all in the runtime package, only in runtime_test external test package. So the common pattern, if you want to test anything non-exported, seems to be a RunX func (in this _test.go file that is not _test package) that exercises the internal code / returns an exported version, then a usually very thin TestX in the external runtime_test package that calls it.

That’s the reason for the overall setup, but I’ll go back and review the actual logic herein in this one: I added these last couple these tests while just chasing coverage % on the train so might be some room for cleanup/commenting here.

src/runtime/proc.go

sumeerbhola · 2025-10-17T16:22:21Z

src/runtime/proc.go

+//
+// At the same time, we *do* want to yield -- that's why we are here -- if there
+// is work waiting for a chance to run. We can balance our preference to give
+// the other P a chance to just run it vs not making it wait too longwith a


nit: long with

src/runtime/proc.go

sumeerbhola · 2025-10-17T16:36:06Z

Given that we are getting closer to merging this, I think we need a plan for how we will maintain this. Specifically the couple of lines of code scattered over various parts in proc.go e.g. we need a list of what cases we need to integrate with the changes, and how to go about finding those cases in the scheduler code.

dt · 2025-10-25T21:52:23Z

the couple of lines of code scattered over various parts in proc.go

Per the discussion thread above and in the google doc, I've cut these "lines scattered around" (assuming we're referring to the same ones) since my current approach is to just do searches of runqs when npidle is zero. This seems to perform about the same, and better handled added a check of netpoll which I realized I wanted anyway, so the diff is now much closer to a pure, more self contained addition: we need two additions to findRunnable, though they don't depend on anything other than the new yieldq, then two new fields: the yieldq in schedt and the yieldchecks counter/timestamp in G. Other than that it's just the pure addition of the Yield() function and its helpers now.

dt · 2025-10-25T21:59:46Z

I definitely want to see experimental evidence of the benefit.

Here's without/with making all Pacer.Pace() calls include a call to runtime.Yield() (even those where Pacer is nil/elastic AC is off) on a 5x8vcpu cluster running a TPCC 5k IMPORT while serving kv95 20k QPS in the foreground (note the units changed in the first one). CPU utilization is smoother and higher with Yields.

Without a foreground workload, IMPORT throughput is higher with Yields than without, somewhat surprisingly (by 5-10%) while it is slightly lower when there is a workload to yield to. CPU profiling indicate that while maintaining 98% CPU utilization, the Yield() calls account for about 0.8-0.9% of IMPORT's CPU usage and when under <90% utilization more like 0.4-0.6%.

petermattis

I like that this iteration is less intrusive than the previous one. Mostly comments about needing more comments.

petermattis · 2025-10-26T16:42:00Z

src/runtime/proc.go

+		return
+	}
+
+	if p := gp.m.p.ptr(); p.runqhead != p.runqtail || p.runnext != 0 {


If I understand correctly, the p.runqhead != p.runqtail || p.runnext != 0 check is we check that the P's local runq is not empty. If that's correct, it's probably worth a comment in this code to make that clear. I imagine the Go runtime folks know this intimately, but if we have to maintain this patch separately we'll appreciate this context.

+1 to a code comment.

petermattis · 2025-10-26T16:48:37Z

src/runtime/proc.go

+			yieldPark()
+			return
+		}
+		// To avoid thrashing between yields, set yieldchecks to 1: if we yield


Can you add some more words about the "thrashing" scenario this is trying to avoid?

petermattis · 2025-10-26T17:01:46Z

src/runtime/proc.go

+	if count := gp.yieldchecks & yieldCountMask; (count & (count + 1)) == 0 {
+		prev := gp.yieldchecks &^ yieldCountMask
+		now := uint32(nanotime()>>yieldEpochShift) &^ yieldCountMask
+		if now != prev || count == yieldCountMask {


This definitely deserves some comments to explain what is going on. What are the trade-offs in the number of count bits and the epoch shift? How were these constants arrived at?

@dt I'll wait to review this part after the code comments are added.

I added a comment with more detail on the combined clock+counter-to-check-clock approach, and the choice of 1/4 ms as a time well smaller than your typical "request" latency for any networked service.

src/runtime/proc.go

petermattis · 2025-10-26T17:08:23Z

src/runtime/proc.go

+			gp.yieldchecks = now
+
+			for i := range allp {
+				// We don't need the extra accuracy (and cost) of runqempty here either.


Is the expense of runqempty problematic? Seems like you're only doing this infrequently.

Not that expensive (unless you have a silly number of cores) at least in relative terms in this branch where we have a whole syscall to netpoll. But I think it makes sense to skip it above when checking the local runq and if we're okay skipping it there, we should be okay skipping it here too (they can all be stolen from concurrently all the same).

petermattis · 2025-10-26T17:11:01Z

The results are very compelling.

sumeerbhola

@sumeerbhola reviewed 1 of 2 files at r3, all commit messages.
Reviewable status: 1 of 5 files reviewed, 25 unresolved discussions (waiting on @dt, @petermattis, and @tbg)

src/runtime/runtime2.go line 515 at r3 (raw file):

	runningnanos  int64 // wall time spent in the running state

	yieldchecks uint32 // a packed approx time and count of maybeYield checks.

This needs a longer code comment elaborating on what exactly this represents and what the packing scheme is.

src/runtime/proc.go line 420 at r3 (raw file):

		// To avoid thrashing between yields, set yieldchecks to 1: if we yield
		// right back and see this sentinel we'll park instead to break the cycle.
		gp.yieldchecks = 1

So sometimes yieldchecks is a packed field and sometimes 1? This needs code comments.

src/runtime/proc.go line 7251 at r3 (raw file):

}

// yield_put is the gopark unlock function for Yield. It enqueues the goroutine

I'm confused by the "unlock function" terminology, given there are callers of gopark that pass nil. And I want to make sure we document in a code comment why this is correct from the perspective of the following comment in gopark:

// unlockf must not access this G's stack, as it may be moved between
// the call to gopark and the call to unlockf.
//
// Note that because unlockf is called after putting the G into a waiting
// state, the G may have already been readied by the time unlockf is called
// unless there is external synchronization preventing the G from being
// readied. If unlockf returns false, it must guarantee that the G cannot be
// externally readied.

We don't access the stack, so fine. Is "readied" means transitioning back to runnable state? I suppose that can't happen either because we haven't even put it in the yieldq yet, so no one can discover it and make it runnable. Anything else I am missing?

src/runtime/proc.go

sumeerbhola · 2025-10-27T22:16:08Z

src/runtime/proc.go

+		return
+	}
+
+	if p := gp.m.p.ptr(); p.runqhead != p.runqtail || p.runnext != 0 {


+1 to a code comment.

sumeerbhola · 2025-10-27T22:16:08Z

src/runtime/proc.go

+			yieldPark()
+			return
+		}
+		// To avoid thrashing between yields, set yieldchecks to 1: if we yield


sumeerbhola · 2025-10-27T22:16:08Z

src/runtime/proc.go

+	if count := gp.yieldchecks & yieldCountMask; (count & (count + 1)) == 0 {
+		prev := gp.yieldchecks &^ yieldCountMask
+		now := uint32(nanotime()>>yieldEpochShift) &^ yieldCountMask
+		if now != prev || count == yieldCountMask {


@dt I'll wait to review this part after the code comments are added.

dt

Reviewable status: 0 of 6 files reviewed, 25 unresolved discussions (waiting on @petermattis, @sumeerbhola, and @tbg)

src/runtime/proc.go line 7251 at r3 (raw file):

Previously, sumeerbhola wrote…

I'm confused by the "unlock function" terminology, given there are callers of gopark that pass nil. And I want to make sure we document in a code comment why this is correct from the perspective of the following comment in gopark:
// unlockf must not access this G's stack, as it may be moved between
// the call to gopark and the call to unlockf.
//
// Note that because unlockf is called after putting the G into a waiting
// state, the G may have already been readied by the time unlockf is called
// unless there is external synchronization preventing the G from being
// readied. If unlockf returns false, it must guarantee that the G cannot be
// externally readied.
We don't access the stack, so fine. Is "readied" means transitioning back to runnable state? I suppose that can't happen either because we haven't even put it in the yieldq yet, so no one can discover it and make it runnable. Anything else I am missing?

Yeah, that's pretty much it: nothing is going to ready this G until findRunnable pull it from the yieldq so it is our to pu there, so I think we can guarantee it isn't externally readied.

src/runtime/proc.go

dt · 2025-10-28T20:28:27Z

Only one data point but I just rebased this from go 1.23 to go 1.25 since CRDB just switched today and the rebase was clean with no conflicts on the scaled back touch points of just findRunnable+two new fields. I think the previous version with the extra pending work atomic did have master/1.25 -> 1.23 conflicts when I went the other way, so maybe some evidence that being more contained is indeed helpful here.

petermattis

This approach looks good to me. I'll defer to @sumeerbhola for final approval. Really excited about the benefits you're seeing in test scenarios.

petermattis · 2025-10-29T14:20:06Z

src/runtime/proc.go

+			// Set yieldchecks to just new high timestamp bits, cleaning counter.
+			gp.yieldchecks = now
+
+			// Check runqs of all Ps; if we find anything park free this P to steal.


Nit: anything parked?

dt requested review from sumeerbhola and tbg October 6, 2025 22:17

petermattis reviewed Oct 8, 2025

View reviewed changes

src/runtime/proc.go Outdated Show resolved Hide resolved

src/runtime/proc.go Outdated Show resolved Hide resolved

src/runtime/proc.go Outdated Show resolved Hide resolved

rickystewart force-pushed the cockroach-go1.23.12 branch 3 times, most recently from 84fef0d to ec86954 Compare October 9, 2025 21:21

sumeerbhola requested changes Oct 9, 2025

View reviewed changes

sumeerbhola reviewed Oct 10, 2025

View reviewed changes

src/runtime/proc.go Show resolved Hide resolved

tbg reviewed Oct 10, 2025

View reviewed changes

src/runtime/backgroundyield_test.go Outdated Show resolved Hide resolved

tbg reviewed Oct 10, 2025

View reviewed changes

dt force-pushed the yield branch 5 times, most recently from e673937 to 642d058 Compare October 16, 2025 02:33

dt changed the title ~~runtime: add runtime.BackgroundYield()~~ runtime: add runtime.Yield() Oct 16, 2025

sumeerbhola requested changes Oct 17, 2025

View reviewed changes

dt force-pushed the yield branch 2 times, most recently from a642b54 to 4a07233 Compare October 25, 2025 21:45

petermattis reviewed Oct 26, 2025

View reviewed changes

sumeerbhola requested changes Oct 27, 2025

View reviewed changes

dt force-pushed the yield branch 5 times, most recently from 891a05a to 1079405 Compare October 28, 2025 03:58

dt commented Oct 28, 2025

View reviewed changes

petermattis reviewed Oct 28, 2025

View reviewed changes

src/runtime/proc.go Show resolved Hide resolved

src/runtime/proc.go Show resolved Hide resolved

dt force-pushed the yield branch from 1079405 to 871cd2f Compare October 28, 2025 20:21

runtime: add runtime.Yield

83b02b4

dt force-pushed the yield branch from 871cd2f to 83b02b4 Compare October 28, 2025 20:23

dt changed the base branch from cockroach-go1.23.12 to cockroach-go1.25.3 October 28, 2025 20:23

petermattis reviewed Oct 29, 2025

View reviewed changes

runtime: add runtime.Yield() #8

Are you sure you want to change the base?

runtime: add runtime.Yield() #8

Uh oh!

Conversation

dt commented Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

petermattis left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

sumeerbhola commented Oct 17, 2025

Uh oh!

dt commented Oct 25, 2025

Uh oh!

dt commented Oct 25, 2025

Uh oh!

petermattis left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dt Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

petermattis commented Oct 26, 2025

Uh oh!

sumeerbhola left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dt commented Oct 6, 2025 •

edited

Loading

dt Oct 28, 2025 •

edited

Loading